Check Your Chances To Get A Loan.¶

Hi, this is Afroz Samee. Thank you for your interest in reading this notebook. In this notebook, I walked you through the process I followed to predict whether a loan was granted successfully or not. I used the logistic regression machine learning algorithm and the Python libraries Pandas and NumPy for data cleaning. For data visualization, I worked with Matplotlib and Seaborn, and for more interactive analysis, I used Plotly.

This may sound technical and new to some of you, but trust me, it isn’t. By the end of this notebook, you will have learned or gained a clearer understanding of machine learning techniques.

How is this Notebook Different?¶

The main question is: how is my notebook different from others available online? Simply put, the uniqueness of this notebook lies in the fact that it not only explains what steps were taken to achieve the result but also why these steps were necessary. Additionally, I provided brief explanations for various functions and comparisons on which techniques are better suited for specific use cases. There are no doubt numerous tutorials on loan prediction, especially on Kaggle; however, many of these tutorials offer limited explanations and, at times, feature an overuse of graphs without much added value. For example, these notebook1, notebook2 is primarily filled with graphs without sufficient context.

While the prediction accuracy of this model may not be perfect, I believe this notebook serves as a helpful guide to creating and improving a machine learning model. It demonstrates the steps involved in modeling, offers a framework for building meaningful explanations, and encourages further exploration of algorithms to compare their accuracies and false-positive rates.

Let me briefly outline the steps I followed in this notebook:

  1. Importing Required Libraries
  2. Loading and Understanding Your Dataset
  3. Data Cleaning
  4. Exploring Data Types & Values of Columns
  5. Exploratory Data Analysis
  6. Preprocessing the Data
  7. Feature Engineering and Scaling Techniques
  8. Understanding the Correlation Between Columns
  9. Applying the Machine Learning Model
  10. Prediction Summary
  11. References
In [17]:
#extracting the zip file
from zipfile import ZipFile
file_name = 'playground-series-s4e10.zip'

with ZipFile(file_name ,'r') as zip:
    zip.extractall()
    print('completed')
completed

Importing and Setting Up the Required Libraries¶

Below are the Python libraries I used for data analysis, manipulation, and training:

  • Pandas: This library was used for data manipulation and analysis, especially when working with dataframes—2D tabular data with labeled rows and columns. It also provided functions for handling files, dealing with missing data, removing duplicates, grouping rows or columns, and applying aggregate functions.

  • NumPy: This library provided numerical computation capabilities, particularly for working with multi-dimensional arrays and applying mathematical functions.

  • Matplotlib: I used this library to create interactive 2D visualizations such as line plots, bar charts, histograms, and scatter plots.

  • Seaborn: Built on top of Matplotlib, Seaborn was used to visualize statistical data. In this notebook, I worked with box plots, heatmaps, and bar plots, among others.

  • Plotly: Plotly offered interactive, 3D plotting and dynamic visualizations, allowing me to create animated charts, sliders, and fully interactive dashboards. It supports large datasets and comes with features like zooming, hovering, and more.

  • Scikit-learn (sklearn): Built on top of NumPy, SciPy, and Matplotlib, this library provided efficient tools for data analysis. I used it for logistic regression, data preprocessing techniques such as scaling and encoding, and model evaluation with metrics like accuracy score and cross-validation.

Tip: "Import only those libraries that will be used in the notebook, and import all the libraries in one code block for clarity and better understanding."
In [12]:
#Import and setup required libraries or packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, accuracy_score, classification_report

from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

Loading and Exploring the Datasets¶

Data Files¶

  • Training Data: This dataset contained both the independent variables (features) and the dependent variable (target). The machine learning models were trained on this data, learning patterns that would be applied later to the testing data to predict the target variable.

  • Testing Data: This dataset consisted only of the independent variables, meaning it contained all the feature columns without the target variable.

  • Reference: Kaggle's DataSet Loan Approval Prediction (Playground Series - Season 4, Episode 10)

Knowing the Datasets¶

Before applying any machine learning algorithm, it was essential to follow some initial steps. The first and most critical step was understanding the dataset thoroughly. This involved knowing its shape, the data types of each feature, whether the target variable was numerical or categorical, and gaining a clear understanding of what you would be working on.

In [19]:
#The files are read and loaded into the dataframe using pandas which is denoted as 'pd' in my notebook
trainLoan_df = pd.read_csv('train.csv')
testLoan_df = pd.read_csv('test.csv')
#.head() will display first 5 rows of the dataframe
display(trainLoan_df.head())
display(testLoan_df.head())
id person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length loan_status
0 0 37 35000 RENT 0.0 EDUCATION B 6000 11.49 0.17 N 14 0
1 1 22 56000 OWN 6.0 MEDICAL C 4000 13.35 0.07 N 2 0
2 2 29 28800 OWN 8.0 PERSONAL A 6000 8.90 0.21 N 10 0
3 3 30 70000 RENT 14.0 VENTURE B 12000 11.11 0.17 N 5 0
4 4 22 60000 RENT 2.0 MEDICAL A 6000 6.92 0.10 N 3 0
id person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
0 58645 23 69000 RENT 3.0 HOMEIMPROVEMENT F 25000 15.76 0.36 N 2
1 58646 26 96000 MORTGAGE 6.0 PERSONAL C 10000 12.68 0.10 Y 4
2 58647 26 30000 RENT 5.0 VENTURE E 4000 17.19 0.13 Y 2
3 58648 33 50000 RENT 4.0 DEBTCONSOLIDATION A 7000 8.90 0.14 N 7
4 58649 26 102000 MORTGAGE 8.0 HOMEIMPROVEMENT D 15000 16.32 0.15 Y 4
In [21]:
#.shape, gives the number of rows and columns in the dataframe
print("Train Dataset shape:",trainLoan_df.shape)
print("Test Dataset shape:",testLoan_df.shape)
Train Dataset shape: (58645, 13)
Test Dataset shape: (39098, 12)
In [23]:
#.info() consize summary of the dataframe
trainLoan_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58645 entries, 0 to 58644
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          58645 non-null  int64  
 1   person_age                  58645 non-null  int64  
 2   person_income               58645 non-null  int64  
 3   person_home_ownership       58645 non-null  object 
 4   person_emp_length           58645 non-null  float64
 5   loan_intent                 58645 non-null  object 
 6   loan_grade                  58645 non-null  object 
 7   loan_amnt                   58645 non-null  int64  
 8   loan_int_rate               58645 non-null  float64
 9   loan_percent_income         58645 non-null  float64
 10  cb_person_default_on_file   58645 non-null  object 
 11  cb_person_cred_hist_length  58645 non-null  int64  
 12  loan_status                 58645 non-null  int64  
dtypes: float64(3), int64(6), object(4)
memory usage: 5.8+ MB
In [25]:
trainLoan_df.describe().T
Out[25]:
count mean std min 25% 50% 75% max
id 58645.0 29322.000000 16929.497605 0.00 14661.00 29322.00 43983.00 58644.00
person_age 58645.0 27.550857 6.033216 20.00 23.00 26.00 30.00 123.00
person_income 58645.0 64046.172871 37931.106979 4200.00 42000.00 58000.00 75600.00 1900000.00
person_emp_length 58645.0 4.701015 3.959784 0.00 2.00 4.00 7.00 123.00
loan_amnt 58645.0 9217.556518 5563.807384 500.00 5000.00 8000.00 12000.00 35000.00
loan_int_rate 58645.0 10.677874 3.034697 5.42 7.88 10.75 12.99 23.22
loan_percent_income 58645.0 0.159238 0.091692 0.00 0.09 0.14 0.21 0.83
cb_person_cred_hist_length 58645.0 5.813556 4.029196 2.00 3.00 4.00 8.00 30.00
loan_status 58645.0 0.142382 0.349445 0.00 0.00 0.00 0.00 1.00

Data Cleaning¶

  1. Checking for Null Values: The first step was to check if there were any null values in the datasets. In my case, both datasets did not have any null values to handle.

  2. Handling Irrelevant Information: It was important to identify and remove any irrelevant or incorrect information early to prevent the model from overfitting. For example, in my dataset, one person’s age was listed as 123 years, which was clearly unrealistic, and since it was the only instance, I decided to drop that row. Similarly, another entry had an employment time that was greater than the person’s age, which was not possible.

Example: When dealing with such issues, it's better to check if there is a majority value to fill with relevant data, but if there are only a few problematic rows, it's better to drop them.
In [28]:
print(f"Train data missing values if any: \n{trainLoan_df.isnull().sum()}")
Train data missing values if any: 
id                            0
person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
loan_status                   0
dtype: int64
In [30]:
print(f"Test data missing values if any: \n{testLoan_df.isnull().sum()}")
Test data missing values if any: 
id                            0
person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
dtype: int64
In [32]:
#drop the rows if persons age is 123 
print(f'Rows in dataframe having age equal to 123:\n{(trainLoan_df['person_age'] == 123).sum()}')
trainLoan_df = trainLoan_df[trainLoan_df['person_age'] != 123]
trainLoan_df.shape
Rows in dataframe having age equal to 123:
1
Out[32]:
(58644, 13)
In [34]:
#drop the rows if persons employment is equal to 123
print(f'Rows in dataframe having person_emp_length equal to 123:\n{(trainLoan_df['person_emp_length'] == 123).sum()}') 

trainLoan_df = trainLoan_df[trainLoan_df['person_emp_length'] != 123]
trainLoan_df.shape
Rows in dataframe having person_emp_length equal to 123:
2
Out[34]:
(58642, 13)
In [36]:
#Select rows where 'person_emp_length' is less than 'person_age', to eliminate the number of rows having employment length greater than persons age. 
print(f'Rows in dataframe having age of person less than employment period:\n{(trainLoan_df['person_emp_length'] > trainLoan_df['person_age']).sum()}') 
condition = trainLoan_df['person_emp_length'] < trainLoan_df['person_age']
trainLoan_df = trainLoan_df.loc[condition]
trainLoan_df.shape
Rows in dataframe having age of person less than employment period:
0
Out[36]:
(58642, 13)
In [38]:
#Select rows where 'cb_person_cred_hist_length' is less than 'person_age', to eliminate the number of rows having credit history greater than persons age. 
print(f'Rows in dataframe having age of person less than employment period:\n{(trainLoan_df['cb_person_cred_hist_length'] > trainLoan_df['person_age']).sum()}')
condition = trainLoan_df['cb_person_cred_hist_length'] < trainLoan_df['person_age']
trainLoan_df = trainLoan_df.loc[condition]
trainLoan_df.shape
Rows in dataframe having age of person less than employment period:
1
Out[38]:
(58641, 13)

Extracting Unique Values and Data Types¶

Extracting unique values and identifying data types early in the analysis offers several key advantages. This step allows us to identify categorical columns, which is essential for choosing the right encoding techniques. For instance, in this notebook, I apply Label Encoding and pd.get_dummies (similar to one-hot encoding) based on the unique values within each categorical column.

Understanding whether a categorical column is related to the target variable is also important. If a categorical column has numerous unique values but shows no significant relationship with the target, we might consider dimensionality reduction techniques to simplify the data. This process helps manage and balance the dataset more effectively, enabling the selection of encoding methods that preserve interpretability.

This approach also allows flexibility, enabling us to revisit and adjust encoding techniques as the analysis evolves.

In [41]:
non_numerical_columns = trainLoan_df.select_dtypes(include=['object']).columns.tolist()
for col in non_numerical_columns:
    print(f"Column: {col}")
    print(f"Unique Values: {trainLoan_df[col].unique()}")
    print("\n")
Column: person_home_ownership
Unique Values: ['RENT' 'OWN' 'MORTGAGE' 'OTHER']


Column: loan_intent
Unique Values: ['EDUCATION' 'MEDICAL' 'PERSONAL' 'VENTURE' 'DEBTCONSOLIDATION'
 'HOMEIMPROVEMENT']


Column: loan_grade
Unique Values: ['B' 'C' 'A' 'D' 'E' 'F' 'G']


Column: cb_person_default_on_file
Unique Values: ['N' 'Y']


In [43]:
non_numerical_columns = testLoan_df.select_dtypes(include=['object']).columns.tolist()
for col in non_numerical_columns:
    print(f"Column: {col}")
    print(f"Unique Values: {testLoan_df[col].unique()}")
    print("\n")
Column: person_home_ownership
Unique Values: ['RENT' 'MORTGAGE' 'OWN' 'OTHER']


Column: loan_intent
Unique Values: ['HOMEIMPROVEMENT' 'PERSONAL' 'VENTURE' 'DEBTCONSOLIDATION' 'EDUCATION'
 'MEDICAL']


Column: loan_grade
Unique Values: ['F' 'C' 'E' 'A' 'D' 'B' 'G']


Column: cb_person_default_on_file
Unique Values: ['N' 'Y']


Exploratory Data Analysis¶

In [46]:
#sns.countplot(x = 'loan_status', data = trainLoan_df)
def plot_countplot(data, column):
    sns.countplot(data=data, x=column, palette = "Set2", hue = column)
    plt.xlabel('Loan_status')
    plt.ylabel('Count')
    plt.title('Representation of Approved Loans over Rejected')
    sns.despine()
    plt.show()

plot_countplot(trainLoan_df, trainLoan_df.loan_status)
No description has been provided for this image
In [48]:
def numericalData_histPlot(df, feature, target):
    
    fig = px.histogram(df, x = feature, color= target,
                        title = f'Distribution of {feature} by loan Status',
                        labels={feature: feature, 'count': 'Count', target: target},
                        hover_data={feature: True, target: True})
    
    
    # Update the layout for better visuals
    fig.update_layout(
        xaxis_title=feature,
        yaxis_title=target,
        bargap=0.1,  # Adjusts the gap between bars
        hovermode="x unified",  # Display hover data for both colors (stacked bars)
        showlegend=True,
        title_x=0.5, width=900, height=500
    )
    
    # Show the figure
    
    fig.show()


numericalData_histPlot(trainLoan_df, 'person_age', 'loan_status')

numericalData_histPlot(trainLoan_df, 'loan_amnt', 'loan_status')
In [49]:
def plot_categorical_data_interactive(df, feature1, target):
    """Creating a bar graph and a pie chart using plotly express and implementing sub graphs using subplotting module from plotly"""
    fig = make_subplots(rows=1, cols=2, 
                        subplot_titles=(f'Countplot of {feature1} by {target}', f'Pie chart of {feature1}'),
                        specs=[[{"type": "xy"}, {"type": "domain"}]]) 

    # Plotting bar graph using plotly express
    bar = px.histogram(df, x=feature1, color=target, barmode='group',
                           labels={feature1: feature1, 'count': 'Count'},
                           color_discrete_sequence=px.colors.qualitative.Set1)
    for trace in bar['data']:
        fig.add_trace(trace, row=1, col=1)

    # plotting pie chart using graph objects module from plotly 
    pie = df[feature1].value_counts()
    pie_fig = go.Pie(labels=pie.index, values=pie.values,
                     marker=dict(colors=px.colors.qualitative.Set1), 
                     hole=0.3)
    fig.add_trace(pie_fig, row=1, col=2)

    # complete layout
    fig.update_layout(title_text=f'Comparison of {feature1} by {target}', showlegend=True,
                      title_x=0.5, width=1000, height=500)

    fig.show()

plot_categorical_data_interactive(trainLoan_df, 'person_home_ownership', 'loan_status')
plot_categorical_data_interactive(trainLoan_df, 'loan_intent', 'loan_status')

Data Preprocessing Techniques¶

Here, I’m using two primary data preprocessing methods:

Label Encoding¶

Label encoding assigns a unique numerical label to each unique value in a categorical attribute. For instance, in our dataset, the attribute cb_person_home_ownership has values N and Y, which are transformed to 0 and 1 respective.

Get Dummies (One-Hot Encoding)¶

This method is useful when there is no inherent order or priority among categories. One-hot encoding creates a binary (0 or 1) column for each unique value in the feature, where 1 indicates the presence of that value and 0 indicates absence.

Note: If the feature has many unique values, this approach can lead to a large number of columns, which increases memory use and may slow down the model. The pd.get_dummies() method in Pandas is similar to OneHotEncoder in Scikit-learn.

In [51]:
def preprocessing_trainingDataSet(df):
    '''This function performs preprocessing using label encoding and the `get_dummies` method from pandas, 
    which is similar to one-hot encoding.'''

    le = LabelEncoder()
    
    df['cb_person_default_on_file'] = le.fit_transform(df['cb_person_default_on_file'])
    df['loan_grade'] = le.fit_transform(df['loan_grade'])
    
    loan_intent = pd.get_dummies(df['loan_intent'])
    person_home_ownership = pd.get_dummies(df['person_home_ownership'])
    
    df.drop(['loan_intent', 'person_home_ownership'], axis=1, inplace=True)
    print(f'Shape of a dataset after droping type object columns:{df.shape}')
    
    df_onehotEncoding = pd.concat([loan_intent,person_home_ownership],axis=1)
    print(f'Shape of a dataset having preprocessing technique of getting summies had applied:{df_onehotEncoding.shape}')
    
    df = pd.concat([df,df_onehotEncoding],axis=1)
    print(f'Shape of dataset after adding preprocessed columns:{df.shape}')
    
    return df
    
trainLoan_df = preprocessing_trainingDataSet(trainLoan_df)   
display(trainLoan_df.head())
Shape of a dataset after droping type object columns:(58641, 11)
Shape of a dataset having preprocessing technique of getting summies had applied:(58641, 10)
Shape of dataset after adding preprocessed columns:(58641, 21)
id person_age person_income person_emp_length loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length ... DEBTCONSOLIDATION EDUCATION HOMEIMPROVEMENT MEDICAL PERSONAL VENTURE MORTGAGE OTHER OWN RENT
0 0 37 35000 0.0 1 6000 11.49 0.17 0 14 ... False True False False False False False False False True
1 1 22 56000 6.0 2 4000 13.35 0.07 0 2 ... False False False True False False False False True False
2 2 29 28800 8.0 0 6000 8.90 0.21 0 10 ... False False False False True False False False True False
3 3 30 70000 14.0 1 12000 11.11 0.17 0 5 ... False False False False False True False False False True
4 4 22 60000 2.0 0 6000 6.92 0.10 0 3 ... False False False True False False False False False True

5 rows × 21 columns

In [52]:
#preprocessing testing data set
testLoan_df = preprocessing_trainingDataSet(testLoan_df)   
display(testLoan_df.head())
Shape of a dataset after droping type object columns:(39098, 10)
Shape of a dataset having preprocessing technique of getting summies had applied:(39098, 10)
Shape of dataset after adding preprocessed columns:(39098, 20)
id person_age person_income person_emp_length loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length DEBTCONSOLIDATION EDUCATION HOMEIMPROVEMENT MEDICAL PERSONAL VENTURE MORTGAGE OTHER OWN RENT
0 58645 23 69000 3.0 5 25000 15.76 0.36 0 2 False False True False False False False False False True
1 58646 26 96000 6.0 2 10000 12.68 0.10 1 4 False False False False True False True False False False
2 58647 26 30000 5.0 4 4000 17.19 0.13 1 2 False False False False False True False False False True
3 58648 33 50000 4.0 0 7000 8.90 0.14 0 7 True False False False False False False False False True
4 58649 26 102000 8.0 3 15000 16.32 0.15 1 4 False False True False False False True False False False

Data Splitting and Scaling Techniques¶

Feature Scaling¶

In machine learning, features are mapped into n-dimensional space. If one variable (e.g., y) has much larger values than another (e.g., x), the Euclidean distance will be dominated by the larger variable, leading to potential loss of important information. Feature scaling solves this problem by normalizing or standardizing the data.

Reasons for Feature Scaling:¶

  1. To better approximate a theoretical distribution with desirable statistical properties.
  2. To spread out data more evenly.
  3. To make data distribution more symmetric.
  4. To linearize relationships between variables.
  5. To ensure constant variance (homoscedasticity).

RobustScaler¶

RobustScaler is a scaling technique that uses the interquartile range (IQR) and the median to scale features. It's useful for datasets with outliers, as it makes scaling more robust and reliable.

Formula

image.png

Where:

  • median(x): The median of the feature.
  • IQR(x): The interquartile range (75th percentile - 25th percentile).
In [58]:
X = trainLoan_df.drop(['id','loan_status'],axis=1)
y = trainLoan_df['loan_status']
In [60]:
# Split the data into training and validation sets (80% train, 20% validation)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [62]:
# Apply scaling method
scaler = RobustScaler()

# fit the training data and transform both testing and training data
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Correlation¶

Correlation measures the linear relationship between two variables and their dependencies. The correlation coefficient quantifies this relationship, ranging from -1 to 1. There are three types of correlation:

  1. Positive Correlation: Two variables move in the same direction (directly proportional), denoted by +1.
  2. Negative Correlation: Two variables move in opposite directions (inversely proportional), denoted by -1.
  3. No Correlation: No relationship between the variables.

By using a correlation matrix (corr()), we can identify which features have a strong relationship with the target variable. Features with no significant relationship can be dropped. However, since I had only 12 features, I chose to keep all of them for training.

In [65]:
df_corr = trainLoan_df.loc[:, ~trainLoan_df.columns.isin(['loan_status','id'])]
# Select all columns except specified ones


plt.figure(figsize=(19, 8))
sns.heatmap(df_corr.corr(), fmt = '.1f', cmap="coolwarm", annot=True)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

Logistic Regressor¶

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is binary. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more independent variables Logistic Regression is used when the dependent variable (target) is categorical.

For example:

  • To predict whether an email is spam (1) or (0).
  • Whether online transaction is fraudulent (1) or not (0).
  • Whether the loan is granted (1) or not (0).
In [68]:
# Initialize Logistic Regression model with class_weight='balanced'
model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
print("Training Logistic Regression...")
model.fit(x_train, y_train)

# Make predictions
y_pred = model.predict(x_test)
y_prob = model.predict_proba(x_test)[:, 1]
Training Logistic Regression...

¶

Prediction Summary¶


Confusion Matrix¶

A confusion matrix is used only for classification tasks. It summarizes the performance of a classification model by comparing actual and predicted labels. The matrix consists of the following four metrics:

Predicted True Predicted False
Actual True True Positive (TP) False Negative (FN)
Actual False False Positive (FP) True Negative (TN)

Accuracy¶

The Accuracy of a classification model is calculated as the ratio of correctly predicted observations (both true positives and true negatives) to the total number ofoservations:

image.png

In [72]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

# Display evaluation metrics
print(f"\nAccuracy: {accuracy:.2f}")
print(f"AUC: {auc:.2f}")
print("\nClassification Report:\n", report)

# Plot confusion matrix
TFmatrix = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
TFmatrix.plot(cmap='RdPu')
plt.title('Confusion Matrix - Logistic Regression')
plt.show()
Accuracy: 0.90
AUC: 0.89

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.98      0.94     10080
           1       0.75      0.45      0.56      1649

    accuracy                           0.90     11729
   macro avg       0.83      0.71      0.75     11729
weighted avg       0.89      0.90      0.89     11729

No description has been provided for this image
In [74]:
# Assume you have only two features for this plot to work well
plt.figure(figsize=(8, 6))

# Plot the test data points, colored by their actual labels
plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, cmap='viridis', edgecolor='k', alpha=0.6, label="Actual")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Actual vs Predicted Outcomes")

# Decision boundary
x_vals = np.linspace(x_test[:, 0].min(), x_test[:, 0].max(), 100)
y_vals = -(model.coef_[0][0] * x_vals + model.intercept_[0]) / model.coef_[0][1]
plt.plot(x_vals, y_vals, color="red", linewidth=2, label="Decision Boundary")

plt.legend()
plt.show()
No description has been provided for this image
In [76]:
# Predict the target values for the test dataset (drop the 'id' column first)
x_testPredict = testLoan_df.drop(['id'], axis=1)
x_testPredict = scaler.transform(x_testPredict)
test_predictions = model.predict(x_testPredict)

# Add the predicted values to your test dataset
testLoan_df['Predicted_Loan_Status'] = test_predictions
In [78]:
print(testLoan_df['Predicted_Loan_Status'].unique())
[1 0]

References¶

  • Kaggle's DataSet Loan Approval Prediction (Playground Series - Season 4, Episode 10)__
  • I have used this notebook tutorial to get initial ideas to create layout of this notebook
  • I have used this kaggle notebook to get understanding on encoding techniques
  • I have used this notebook as example reference
  • To briefly understand about notebooks I have done reading on kaggle and implemented it
  • Some of my content is taken from my previous work
In [2]:
!jupyter nbconvert --to html AI_CW.ipynb
[NbConvertApp] Converting notebook AI_CW.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 4 image(s).
[NbConvertApp] Writing 6270816 bytes to AI_CW.html